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Abstract — We study the fundamental problem of optimal 
power allocation over two identical Gilbert-Elliott (Binary 
Markov) communication channels. Our goal is to maximize the 
expected discounted number of bits transmitted over an infinite 
time span by judiciously choosing one of the four actions for 
each time slot: 1) allocating power equally to both channels, 
2) allocating all the povfer to channel 1, 3) allocating all the 
power to channel 2, and 4) allocating no power to any of 
the channels. As the channel state is unknown when power 
allocation decision is made, we model this problem as a partially 
observable Markov decision process(POMDP), and derive the 
optimal policy which gives the optimal action to take under 
different possible channel states. Two different structures of the 
optimal policy are derived analytically and verified by linear 
programming simulation. We also illustrate how to construct 
the optimal policy by the combination of threshold calculation 
and linear programming simulation once system parameters are 
known. 

I. Introduction 

Adaptive power control is an important technique to select 
the transmission power of a wireless system according to 
channel condition to achieve better network performance in 
terms of higher data rate or spectrum efficiency [1],[2]. There 
has been some recent work on power allocation over stochastic 
channels [3], [4], [5]; the problem of optimal power allocation 
across multiple dynamic stochastic channels is challenging and 
remains largely unsolved from a theoretical perspective 

We consider a wireless system operating on two parallel 
transmission channels. The two channels are statistically iden- 
tical and independent of each other. We model each channel 
as a slotted Gilbert-Elliott channel. That is, each channel 
is described by a two-state Markov chain, with a bad state 
"0" and a good state "1" [7]. Our objective is to allocate 
the limited power budget to the two channels dynamically 
so as to maximize the expected discounted number of bits 
transmitted over time. Since the channel state is unknown 
when the decision is made, this problem is more challenging 
than it looks hke. 

Recently, several works have explored different sequential 
decision-making problems involving Gilbert-Elliott channels. 
In [8], [9], the authors consider the problem of selecting one 
channel to sense/access among several identical channels, for- 
mulate it as a restless multi-armed bandit problem, and show 
that a simple myopic pohcy is optimal whenever the channels 



are positively correlated over time. In [6], the authors study 
the problem of dynamically choosing one of three transmitting 
schemes for a single Gilbert-Elhott channel in an attempt to 
maximize the expected discounted number of bits transmitted. 
And in [10], the authors study the problem of choosing a 
transmitting strategy from two choices emphasizing the case 
when the channel transition probabiUties are unknown. While 
similar in spirit to these two studies, our work addresses a 
more challenging setting involving two independent channels. 
In [6], [8], [9], only one channel is accessed in each time 
slot, while our formulation of power allocation is possible to 
use both channels simultaneously. In [17], a similar power 
allocation problem is studied. Our work in this paper has 
the following differences compared with the work in [17]: 
four power allocation actions are considered instead of 3; 
penalty is introduced when power is allocated to a channel 
in bad condition. With the introduction of one more action 
(using none of the two channels) and transmission penalty, 
the problem becomes more interesting yet more difficult to 
analyze. 

In this paper, we formulate our power allocation problem as 
a partially observable Markov decision process(POMDP). We 
then convert it to a continuous state Markove Decision Process 
(MDP) problem and derive the structure of the optimal policy. 
Our main contributions are:(l)we formulate the problem using 
the MDP theory and theoretically prove the structure of the 
optimal pohcy, (2) we verify our analysis through simulation 
based on hnear programming, (3) we demonstrate how to 
numerically obtain the structure of this optimal policy when 
system parameters are known. 

The results in this paper advance the fundamental under- 
standing of optimal power allocation over multiple dynamic 
stochastic channels from a theoretical perspective. 

II. Problem Formulation 

A. Channel model and assumptions 

We consider a wireless communication system operating on 
two parallel channels. Each channel is described by a slotted 
Gilbert-Elhott model which is a one dimensional two-state 
Markov chain (?i,t(z e {1, 2}, t e {1, 2, oo}): a good state 
denoted by 1 and a bad state denoted by (?' is channel 
index and t is time slot). The state transition probabihties are: 



PrlGu = l|G,,t_i = 1] = Ai and P,[Gi,t = l\Gi^t-i = 0] = 
Xo,i & {1)2}. We assume the two channels are identical and 
independent of each other. Meanwhile channel state transition 
occurs only at the beginning of each time slot. We also assume 
that Ao < Ai, which is a positive correlation assumption 
commonly used in the literature. 

The system has a total power P. At the beginning of each 
slot, the system allocates power Pi (t) to channel 1 and power 
P2{t) to channel 2, where Pi{t) + P2{t) = P. We assume 
channel state is unknown at the beginning of each time slot, 
thus the system needs to decide the power allocation for the 
two channels without knowing the channel states. If a channel 
is used in slot t, its channel state during that slot is revealed 
at the end of time slot t through channel feedback. But if 
a channel is not used, its state during the elapsed time slot 
remains unknown. 

B. Power allocation strategies 

To simplify the power allocation problem, we define three 
power levels the system may allocate to a channel: 0, P/2, P. 
If a channel in good state is allocated power P/2, it can 
transmit i?; bits of data during that slot. If a channel in good 
state is allocated power P, it can transmit bits of data 
successfully. We assume Ri < Rh < 2Ri. At the same time, 
if a channel in bad state is allocated power P/2, it suffers Ci 
bits of data loss. If a channel in bad state is allocated power 
P, it suffers Cu bits of data loss. We assume Ci < Ch < 2Ci 
and Rh > Ch-.Ri > C/. 

At the beginning of each time slot, the system chooses one 
the following four actions: balanced, betting on channel 1, 
betting on channel 2 and conservative. 

Balanced (denoted by Bb): the system allocates power 
evenly on both channels, that is, = P2(i) = P/2 for 

time slot t. This action is chosen when the system believes 
both channels are in good state and it is most beneficial to use 
both of the channels. 

Betting on channel 1 (denoted by Bi): the system decides 
to "gamble" by allocating all the power to channel 1, that is. 
Pi {t) = P, P2 (t) = 0. This occurs when the system believes 
that channel 1 will be in good state and channel 2 will be in 
bad state. 

Betting on channel 2 (denoted by B2): contrary to B\, the 
system allocates all the power to channel 2, that is, P\{t) = 

0,P2{t)=P. 

Conservative (denoted by Br): the system decides to "play 
safe" by using none of the two channels, that is, Pi{t) = 
P2 {t) = 0. This action is taken when the system believes both 
channels will be in bad state and using any of the channels 
will cause data loss. 

Note that in actions Pi, B2 and B^, if a channel is not 
used, the system will not know its state in the elapsed slot. 

C. Formulation of the Partially Observable Markov Decision 
problem 

At the beginning of each time slot, the system needs to 
judiciously choose one of the four power allocation actions to 



maximize the total discounted number of data bits transmitted 
over an infinite time span. Since the channel state is not 
observable when the choice is made, this power allocation 
problem is a Partially Observable Markov Decision Problem 
(POMDP). In [11], it shows that a sufficient statistic for de- 
termining the optimal action is the conditional probability that 
the channel is in good state at the beginning of the current slot 
given the past history, henceforth this conditional probability is 
called belief. We denote the belief by a two dimensional vector 
xt = {xi,t,X2,t), where Xi^t = Pr[Gi.L = l\fit],i € {1,2}, ht 
is all the history of actions and state observations prior to the 
begiiming of current slot. Using the belief as decision variable, 
the POMDP problem is converted into an MDP problem with 
an uncountable state space ([0, 1], [0, 1]) [8]. 

Define a policy tt as a rule that determines the action to 
take under different situations, that is, a mapping from the 
beUef space to action space. Let V'^{p) denote the expected 
discounted reward with initial belief p = {pi,p2), that is, 

Xl,a = Pr[Glfl = l\ho] = Pl,X2fi = Pr[G2,0 = MK] = P2, 

with TT denoting the poUcy followed. With discount factor /3 € 
[0,1], the expected discounted reward is expressed as 



V^p) = E-[J2(3'gM\^o = p], 



(1) 



t=o 



where denotes the expectation given policy tt, t is the time 
slot index, o,/ e {Pi, P2, Ps, Pr} represents the action taken 
at time t. The term ga^{yit) denotes the expected immediate 
reward when the belief is and action at is chosen: 



xi,t{Rh + Ch) - Ch 

x2,tiRh + Ch) ~ Ch 

{xi,t + X2,t){Ri + Ci) 




2C, 



if at = Pi 

if ai = P2 
if at = Pfc 
if at = Bj. 



Now we define the value function ^(p) as 



F(p)=maxy'^(p) V p G ([0, 1], [0, 1]). 



(2) 



(3) 



A policy is stationary if it is a function mapping the state space 
([0, 1], [0, 1]) into action space {Pi, P2, Pb, Br}. Ross proved 
in [12](Th.6.3) that there exists a stationary policy tt* such 
that V^(p) = (p), and the value function V{p) satisfies 
the Bellman equation 



(4) 



where Va (p) denotes the value acquired when the belief is p 
and action a is taken. Va(p) is given by 

Vaip) = ga{p) + l3Ey[V{y)\^o = p,ao - a], (5) 

where y denotes the next belief after action a is taken when 
the initial belief is p. V^(p) for the four actions is derived as 

follows. 

a) Balanced(Bb): If this action is taken with initial belief 
P = (Pi,P2)> the immediate reward is PiRi + P2R1 and the 
immediate loss is (1 —pi)Ci + (1 —p2)Ci. Since both channels 
are used, their states during the current slot are revealed at 



the end of current time slot. Therefore with probabihty pi 
channel 1 will be in good state hence the belief of channel 1 
at the beginning of the next slot will be Ai. Likewise, with 
probability 1—pi channel 1 will be in bad state thus the belief 
in the next slot will be Aq. Since both channels are identical, 
channel 2 has similar belief update. Consequently, the value 
function when action Bi, is taken can be expressed as 



= {pi+p2){Ri + Ci)-2Ci 

+ /3[(1 - - P2)^(Ao, Ao) + PiP2V{\u\i) 

+ pi(l - P2)^(Ai, Ao) + (1 - Pi)p2V{Xa, Ai)] 



(6) 



b) Betting on channel l(Bi): If this action is taken with 
initial belief p = {pi,p2), the immediate reward is piRh, 
and the immediate loss is {1 — pi)Ch- Since channel 2 is not 
used, its channel state in the current slot remains unknown. 
Therefore the belief of channel 2 in the next time slot is 
calculated as 



TiP2) = (1 -P2)Ao +P2A1 = ap2 + Ao, 



(7) 



where a = Ai — Ao. Consequently, the value function when 
action Bi is taken can be expressed as 

VbAp) 

= {Rh + Ch)pi-Ch . (8) 

+ (3[p,ViX,,T{p2)) + (1 -pi)y(Ao,T(p2))] 

c) Betting on channel 2(B2)'. Similar to action Bi, the value 
function when actin B2 is taken can be expressed as 

= {Rh + Ch)p2-Ch , (9) 

+ /3[p2V{Tip,), Ai) + (1 - P2)V{T{p,), Ao)] 



where 



T{pi) = (1 -_pi)Ao +P1X1 = api + Ao. 



(10) 



d) Conservative(Br): If this action is taken, both immediate 
reward and loss are 0. Since none of the channel is used, their 
belief at the beginning of the next slot is given by 



A. Properties of value function 

Lemma 1: Va(p),a e {Bi,, Bi, B2, Br} is affine in both 
pi and p2 and the following equalities hold: 

Vaicp + (1 - C)p',p2) = CK(P,P2) + (1 - c)Vaip',P2) 
Va{pi,Cp+{l - C)p') - cVa{pi,p) + (1 - c)Kbl,p')' 

where < c < 1 is a constant, and f{x) is said to be affine 
with respect to x if f{x) = ax + c with constant a and c. 

Proof It is clear from (|6]l that Vb,, is affine in pi and p2- 
Also it is obvious that Vgi is affine in pi and Vg^ is affine 
in p2 from ([8]l and (|9]l, respectively. Next, we will prove that 
Vbi is affine in p2- 

Let's look at the right side of equation |8] The first 
and second terms are not related to p2 so this part 
is affine in p2- For the third term, the main part 
V{c,T{p2)) (c e {Ao,Ai}) takes one of the follow- 
ing four forms: Fb,(c, r(p2)), Vb, (c, T(p2)), Fs, (c, T(p2)) 
or VbAc,T{p2)). The first form is affine in p2 because 
Vb,{c,T{p2)) is affine in T{p2) and T{p2) = ap2 + Ao 
is affine in p2- Similarly the second form Vb2{c,T{p2)) is 
affine in T{p2) thus also affine in p2. For the latter two forms 
VBl{c^T{p2)) and Vb,^{c,T{p2)), they can be written as: 

VbAc,T{p2)) 

= c{RH + Ch)-Ch , (14) 

+ f5cV{\i,T^{p2)) + P{1 - c)T/(Ao,T2(p2)) 



VbMT{p2)) ^VbAT{c),T^P2)), 



(15) 



where T"(p) = r("-i)(T(p)) = ^(1 - a") + «>. 
Since T"(p2) is affine in p2, (14i is affine in p2 as 
soon as V{\i,T'^{p2)) takes the form Vg, (Ai, T"(p2)) 
or Vb2(Ai,T"(p2)), and V{\o,T'^{p2)) takes the form 
Fs,(Ao,r"(p2)) or l^B,(Ao,T"(p2)), n = 2,3,---, which 
is affine in p2- If V{Xi,T'^{p2)) keeps taking the form 
Fsi(Ai,r"(p2)) or Vs.(Ai,r"(P2)) till n goes to infinity, 
it will eventually become Vb^Xi, or Vb,(Ai,y3^) 



because T" (p2 



when n 00, which is a special case 



of affine in p2- The same is true for the term V^(Ao,T (p2))- 



T{p,) = (1 - p,)Xo + Ai - ap. + Ao, ie{l,2}. (11) With this we show that ^ is affine in p2. Similai'ly we can 



Consequently, the value function when action Br is taken can 
be expressed as 



i^B„(p)-/3V(r(pi),T(p2)). 



(12) 



Finally, the Bellman equation for our power allocation 
problem reads as 

Vip)=ma^{VB,,VB,,VB,,VBA. (13) 
III. Structure of the Optimal Policy 

From the discussion in the previous section, we understand 
that an optimal policy exists for our power allocation problem. 
In this section, we try to derive the optimal policy by first 
looking at the features of its structure. 



prove that (15 1 is affine in p2 thus (|8]l is affine in p2- 

Using the same technique we can prove that Vb2{pi,P2) is 
affine in pi and Vb^ is affine in both pi and p2- With this 
we show that Va{p),a £ {Bi,, Bi, B2, Br} is affine in both 
Pi and p2, and the equalities in Lemma 1 immediately follow. 
This concludes the proof ■ 
Lemma 2: V{p) is convex in pi and p2, and the following 
inequalities hold: 



Vicp+{1-C)p',p2) < cV{p,P2) 

V{pi,cp+{l-c)p') < cV{pi,p) 



{l-c)V{p',P2) 

{i-c:)v{pi,p') 



Proof The convexity property of the value function of 
any general POMDP is proved in [11] and we will use that 
result directly in this paper. 



Lemma 3: V{p-i,p2) = V{p2,Pi), that is, V{p) is sym- 
metric with respect to the Une pi = p2 in the beUef space. 

Proof: Let ^"(^1,^2) denote the expected reward when 
the decision horizon spans only n time slots. When n = l, 

= max{{pi+p2){Ri+Ci)-2Ci,0, . (16) 

PiiRh + Ch) - Ch,P2iRh + Ch) - Ch} 

V\P2,P1) 

= max{{pi+p2){Ri + Ci)-2Ci,0, . (17) 

P2{Rh + Ch) - Ch,PiiRh + Ch) - Ch} 

Obviously we have V^{pi,p2) = V^{P2,Pi)- Next we as- 
sume V'^{pi,p2) = V''{p2,pi),k > 1, we now show that 
^^=+'^1,^2) = V''+\p2,pi). Since 

V^+\pi,P2) 

= {pi+p2){Ri+Ci)-2Ci+l3[pip2V''{\uXi) 

+ pi{l-p2)V>^{Xi,Xo) + {l-Pi)P2V''{Xo,Xi) ■ ^ ^ 

+ (l-pi)(l-p2)l^'(Ao,Ao) 

V^+\pi,P2) 

= Pi{Rh + Ch)-Ch+ . (19) 

m-Pi)VH^o,T{p2))+PiV'^{XuT{p2))] 

V^+\P1,P2) 

= P2{Rh + Ch)-Ch+ . (20) 

/3[(1 -p2)l^'=(T(pi), Ao) +p2V'=(r(pi),Ai)] 

^J+'(Pi,P2) = /3y'=(T(pi),T(p2)) . (21) 
Using the assumption that V''{pi,p2) = V''{p2,pi), we have, 



= pi{Rh + Ch) Ch+ 

/3[(1 - p,)V'^{T{p2),Xo)+PiV''{T{p2), Ai 



(22) 



V^t'(P2,Pi) 



Similarly, we have V^+\pi,P2) = V^+^(P2,Pi), 
V^^\pi,P2) = V^+\p2,Pi) and V^+\puP2) = 
V^+\p2,Pi). Therefore, 

V^'+'Kpi,P2) 
= max{V^+\p,,p2),V^+^(p,,p2), 

^b.+\pi,P2),<+Hpi,P2),} 
= ma^{V^+\p2,p^),V^+\p2,Pi), • ^ ^ 

V^+\P2,P1),V^+\P2,P1),} 

= V^^+^\p2,Pl) 

Hence we have V{p\,p2) = V{p2,pi) for all (^1,^2) in the 
belief space. ■ 

B. Properties of the decision regions of policy tt* 

We use $a to denote the decision region of action a. That 
is, <l>a is the set of beliefs under which it is optimal to take 
action a: 



*a = {{P1,P2) e ([0, 1], [0, l])\V{pi,P2) = Va{pi,P2)} 

aG{Bb,Bi,B2,Br} 



Definition 1: is said to be contiguous along pi dimen- 
sion if given (xi,p2), (a;2,P2) G $a, then Vx e [a:i,X2], we 
have {x,p2) € ^a- Similarly, we say <&„ is contiguous along p2 
dimension if given (pi, j/i), (^1,1/2) e $0, then Vy e [1/1,1/2], 
we have {pi,y) E $a- 

Theorem 1: is contiguous in both pi and p2, where 
a e {Bf,, Bi, B2, Br}- 

Proof: We will prove $ as an example, and the results 
for other actions can be proved in a similar manner. First we 
prove that is contiguous in pi. Let {xi,p2), {x2,P2) G 
next we show that {{cxi + (1 — c)x2),P2) is also in 
region $b^, where < c < 1. 

V{{CXI + (1-C)X2),P2) 

< CV{X1,P2) + {1-C)V{X2,P2) 

= cVbAxi,P2) + (1 - c)VBiix2,P2) , (25) 

= Vb^HcXi + {1 - C)X2),P2) 

< V({CXI + {1-C)X2),P2) 

where the first inequality comes from the convexity in 
lemma 2; the first equality follows from the fact that 
{xi,p2),{x2,P2) € $Si; the second equality follows from 
the affine linearity of Vb^ (p) in pi ; the last inequality fol- 
lows from the definition of V{p). We have V{{cxi + (1 — 
c)x2),P2) = VbiUcxi + (1 - c)a;2),P2), that is, {{cxi + 
(1 — c)a;2),P2) is also in the region <f>Bj. Therefore, is 
contiguous in pi. Similarly ^Bi is contiguous in p2- ■ 
Theorem 2: and are self-synmietric with respect 
to the linepi = p2, that is, if {pi,P2) & ^a,a. & {Bb, Br} then 
(P2)Pi) G ^o- ^Bi and ^B2 are mirrors with respect to the 
line pi=p2, that is, if (pi,P2) S $Si then (p2,Pi) e ^Bj- 
Proof: If (pi,P2) € ^Br> then we have 



V{pi,P2) = VBr{Pl,P2)- 



(26) 



Using lennma 3, we have 



VbAp2,Pi) 

= pV{T{p2),T{p^))=pV{T{p^),T{p2)) , (27) 

= Vb,{Pi,P2) =V{pi,P2) ^V{p2,Pl) 

hence {p2,Pi) also belongs to $b^. Similarly, we can show 
that if (pi,P2) e ^B;,, then (p2,Pi) also belongs to 
If (^15^2) € $Bi. then we have 

V{pi,P2) = Vb^{Pi,P2) 

= pi{Rh + Ch)-Ch . (28) 

+ p\prV{XuT{p2)) + (1 - Pi)y(Ao, T{jp2))] 

Using lemma 3, we have 

VbAP2,P\) 

= pi{Rh + Ch)-Ch 

+ l3\piV{T{p2),Xi) + (1 - Pi)V{T{p2),Xo)] 
= pi{Rh + Ch)-Ch 

+ /?[piy(Ai,T(p2)) + (1 -pi)y(Ao,T(p2))] 

= VB^{Pl,P2) ^V{pi,P2) = V{p2,Pl) 



(29) 



(24) hence (^2,^1) belongs to which concludes the proof. 



C. Structure of the optimal policy 

Based on the properties discussed above, we are now ready 
to derive the structure of the optimal policy. 

From the belief update in (j6]l, (|8]l,(|9]l and ( 12 1, it is clear that 
the belief state of a channel is updated to one of the following 
three values after any action: Aq, Ai, or T{p), where p is 
the current belief of a channel. For all < p < 1, Aq < 
T{p) = Ao + (Ai - \o)p < Ai. Since < Ao,Ai < 1, the 
belief space is the rectangle area determined by four vertices 
at (0,0), (0,1), (1,1) and (1,0). 

First we consider the four vertices and it is easy to obtain 
the following results. 



V{0,0) = VbAO,0) 
V{1,0) = VbA^,0) 

y(l,l) = ^B,(l,l) 



(0,0) e 
(0,1) e 

(1.0) e $5, 

(1.1) e*i3. 



(30) 



Next we consider the four edges. On the edge pi = 0, the 
partial value functions are 

VbA0,P2) = -CH + pV{Xo,Tip2)) 

VbA0,P2) - P2{Ri + Ci) ^ 2Ci + 

m-P2)V{Xo,Xo)+P2ViXoAi)] 
VbA0,P2) = /3V(Ao,T(p2)) 

(31) 

Using our assumption Ci < Ch < 2Ci ,Rh > Ch,Ri > 
Ci, and convexity of value function V{pi,p2), we have 
Vb^{0,P2) < ^B,(0,P2) < Vb^{0,P2). With this we say 
that on the edge pi = 0, only two actions B2 and Br are 
possible. Since (0, 0) G and (0, 1) € $53, we know there 
exists a threshold p such that Vp2 G [0,p), (0,^2) G ^b,^ and 
Vp2 G [p, 1], (0,_P2) G ^B2- To derive p we define 

Kxoip) - il-p)V{Xo,Xo)+pV{Xi,Xo) - V{T{p),Xo) 
Si,xAp) = (1 -p)F(Ao, Ai) +p\^(Ai,Ai) ~ V{T{p),Xi) 
S2m{p) = {l-p)V{Xo,Xo)+pV{Xo,Xi) - V(Ao,r(p)) 
S2m{p) = (1 -p)F(Ai,Ao) +pViX,,X,) - V{X,,Tip)) 

(32) 

From the symmetric property of V{pi,p2), we have 
Si^Xoip) = ^2,\oip) = Sx„{p) and (5i,Ai(p) = h,\Ap) = 
S\i{p). Using the fact that ¥32(0, p) = Vb,, (0,/?), we have 



VBMp)~VB^io,p) 

p{Rh + Ch)-Ch + P6xAp)=^ 
^^ Cn-Pdx„{p) 

Rh + Ch 



(33) 



(34) 



Using the results in Theorem 2, we can easily derive similar 
structure on the other three edges. The structure of the optimal 
policy on the boundary of the belief space is shown in Fig.l 
The thresholds Thi and Th2 in Figure 1 are given by 



Thi 



{Rh-Ri)+Ci-l3S^^(Th2) 
Ri+Ci 



(35) 




Thi ■ 



Fig. 1: Structure of the optimal policy on the boundary of 
belief space 



A simple threshold structure on each edge is clear from 
Figure 1. Next we will derive the structure of the optimal 
policy in the whole belief space. 

Theorem 3: is a simple connected region extended from 
da in the belief space ([0, 1], [0, 1]), where 



da = 



(0,0) a = Br 

(0,1) a = B2 

(1.0) a = Bi 

(1.1) a = Bb 



and y{pi,p2) e pi > P2; y{pi,P2) e *S2' Pi < P2- 

Proof: At the beginning of this section we akeady show 
that da G $a, and from Theorem 1, $a has at least one 
connected region extended from da- Therefore next we need 
to show that each <i>Q has only one connected region. 

For let ^'g denote the connected region ex- 

tended from (0, 0), then we will show that there exists 
no other connected region $^ . Since $^ is symmetric, 
let {[0, thi], [0, thi]) be the minimum rectangle to include 
(no other connected region in this rectangle)(Figure 2(a)). 
Suppose there is another connected region ^'g in the area 
{[0, thi], [thi, 1]) or {[thi, I], [0, thi]). Take the former for 
example, then we have V(a;, y) G , line pi — x will pass 
across both and , thus at least two separate parts 
of ^B,. exist on line pi = x, which contradicts the result in 
theorem 1 . Therefore, no connected region exists in area 
{[0,thi],[thi,l]) or {[thi,l],[0,thi]). 

Suppose another connected region $^ exists in 
{[thi, 1], [thi,l]). Let VP'=y{pi) denote Va{puP2) when p2 
is a fixed value y. From lemma 1, the slope of line VJ^^*'(pi) 
is given by 



dpi 



(Pl 



. (36) 

_ n d[{l~y)V{T{pi),Xo)+yV{T{pi)M)]) 
^ dpi 



From ( 36 1 we have 
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Structure of optimal policy on the boundary of the belief space, 
we have y{x,y) e <1>^ , we have VB2{0,y) > VB^{0,y) 
(Figure 2 (b)). 
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(a) 1 -threshold Structure (b) 2-threshold Structure 



Fig. 3: The structure of optimal policy tt* 



It is clear from Fig. 2(b) that there exists no pi = a; such 
that V]^l=y{x) < V^l=y{x). Therefore {x,y) ^ which 
contradicts our assumption that [x^y) e ^'g C ^b^- From 
this we show there is no other connected region in 
the area ([i/ii, 1], [t/ii, 1]). In other words, is the only 
connected region of 

We can prove that ^Bn^B2 or ^b^ has only one connected 
region in a similar manner and the detail is omitted due to 
space limit. 

Next we prove V(pi,p2) G '^B2^ P2 > Pi- Obviously 
has a connected region extended from (0, 1). If 3{x, e 
and X > y, we can find a mirror point of {x,y) with respect 
to line pi = p2 according to the convexity of $53. Then 
both points (x, y) and {y, x) belong to ^s^, which contradicts 
theorem 2. Hence V(pi,p2) G $53' have pi < p2- 
Similarly, V(pi,p2) e Pi > P2- 

m 

When we prove the extended region of $5^, in theorem 3 
(refer to [18] for detail), two types of structures are found on 
the line pi = p2- (1) one threshold structure: 30 < pi < 1, 
such that Vy € [0,pi], {y,y) G andVy e [pi-l]. {y,y) e 
$ B^ ■ (2) two threshold structure: 30 < pi < p2 < 1, such that 
Vy e [0,pi], iy,y) G Vy e [pi,P2], {y,y) € $i3i($Bj; 
and Vy S [/02, y) € ^b^- From theorem 3, the structure 
of the optimal policy is illustrated in Figure 3. 

For the one threshold structure, pi can be obtained by 
solving Vb,, (PIjPi) — VB^iPi^ Pi)- For the two threshold 
structure, pi and p2 can be obtained by solving Vb^{pi, pi) = 




(a) 1 -threshold form (b) 2-threshold form 



Fig. 4: Value function and Structure of optimal policy 

VbApi^Pi) and VbAp2,P2) = Vb,{p2,P2)- Therefore for 
this power allocation problem, we are able to give the basic 
structure of the optimal policy and derive the thresholds on 
four edges and on the line pi — P2- However, so far we are 
unable to derive a closed form expression for the boundary of 
each In the next section, we will use simulation based on 
linear programming to construct the optimal policy and verify 
its features. 

IV. Simulation Based on Linear Programming 

Linear programming is one of the approaches to solve the 
Bellman equation in (4). Based on [13], we model our problem 
as the following linear program: 

minEpex^(p)' 

s.t. ga{p) + PEyexfaiP,y)Viy)<V{p), 
Vp e X, Va e Ap 

(37) 

where X denotes the belief space, Ap is the set of available 
actions for state p. The state transition probability /a(p,y) is 
the probability that the next state will be y when the current 
state is p and the current action is a S Ap. The optimal policy 
is given by 

7r(p) = argmax(y,(p) /,(p,y)P^(p)). (38) 

We used the LOQO solver on NEOS Server [14] with 
AMPL input [15] to obtain the solution of equation (37). 
Then we used MATLAB to construct the policy according 
to equation (38). 



(a) Normalized <I>a with increasing Aq (b) Normalized <l?a with decreasing Ai 

Fig. 5: NormaUzed '^a{Rh/Ri = i/'^,Ch/Ci = 1.2/0.8) 



Figure 4 shows the AMPL solution of the value function 
and the corresponding optimal policy. In Fig 4(a), we use 
the following set of parameters: Aq — 0.1, Ai = 0.9, /3 = 
0.9, Rh/Ri = 3/2 = 1.5,Ch/Ci = 1.2/0.8 = 1.5 and the 
"1 -threshold structure" of the optimal policy is observed; In 
Fig 4(b), we use the same set of parameters as in (a) except 
Rh/Ri = 3.7/2 = 1.85 and the "2-threshold structure" of the 
optimal policy is observed. The optimal policy in Figure 4 
clearly shows the properties we gave in Section 3. 

For our power allocation problem, it is interesting to in- 
vestigate the effect of parameters (such as Ao, Ai, i?/, 
and Ci) on the structure of optimal policy. For this purpose, 
we conducted simulation experiments with varying parameters. 
First, we increase Aq from 0.1 to 0.8 while keeping the rest 
of parameters the same as in experiment in Fig 4(a). Let 
denote the area of $a in total belief space and we normalize 
all $a with total belief space as |"I>a|/|X|. Fig 5(a) shows 
how the normalized $a changes with different Aq. We can 
observe in Fig 5(a) that initially has the biggest area 
with Aq = 0.1. When gradually increasing Aq, becomes 
smaller, whilst ^Bi{^B2) ™d become bigger When 

Aq > 0.5, ^Bi{^B2) occupies the major part of the belief 
space, meaning that Aq is big enough and it is more optimal 
to "gamble" on one channel. Similarly, Fig 5(b) shows the 
results when we decrease Ai from 0.9 to 0.2. \^b\ changes 
in a similar manner as in Figure 5(a). We can see that only 
when Ai is as small as 0.3, I^SiKI^SsI) bigger than I'I'bJ. 
Interestingly, | is always the smallest in both experiments, 
which means the system likes "gambling" instead of "being 
conservative". 

In Figure 4 we already observed that different Rh/Ri ratio 
leads to different structure of optimal policy (1 -threshold or 
2-threshold structure). Therefore we believe the optimal policy 
is closely related to the immediate reward and loss of the four 
actions. And we believe the ratio of Rh/Ri and Ch/Ci has 
more effect on the structure of optimal policy than their real 
value. Therefore, in the next experiment we increase the ratio 
of Rh/Ri with different Ch/Ci. 

Figure 6(a)(c)(e) show the normalized $ , ^Bi , b^ with 
increasing Rh/Ri from 1.05 to 1.95. We can see that when 
Rh/Ri increases, | and |$bJ become smaller while 
I $B J grows bigger, meaning that the immediate reward of us- 



(a) "I>flj, with increasing Rj^/Ri (b) ^'i'h increasing Ci^/Ci 





(c) increasing Rf^/Ri 



(d) ^Bi with increasing Ch/Ci 





R^l=2 5/2=1 25 
F?^l=3/2=1.5 
R„/R,=3 5/3=1 75 



(e) ^B), with increasing Rf^/Ri 



(f) ^Bi with increasing Ch/Ci 



Fig. 6: Normalized (Aq = 0.1, Ai = 0.9) 



ing one channel(i?^) is big enough to justify "gambling". Sim- 
ilarly Figure 6(b)(d)(f) show the normalized $5,, , 'I'Bi , 
with increasing Ch/Ci from 1.05 to 1.95. In contrast to Figure 
6(a)(c)(e), when Ch/Ci increases, |$_b^| and l^&sj grows 
bigger while |$bJ becomes smaller, meaning that the the 
immediate loss of using a channel(C/i) is big enough and the 
system decides to "play safe". 

From above observation we understand that "1-threshold 
structure" may occur with small R^/Ri, big C^jCi and 
Ai — Ao; "2-threshold structure" may occur with big Rh/Ri, 
small Ch/Ci and Ai — Aq. Figure 7 verifies our speculation. 
We can see that in all experiments with a wide range of 
parameters, no other policy structure than 1-threshold and 
2-threshold structure is observed. So we can conclude that 
with the help of linear-programming simulation, once the 
parameters (Aq, Ai, i?^,, Ri,Ch, Ci,fi) are known, the structure 
of optimal policy can be derived like in Figure 7(a)(b). 

V. Conclusion 

In this paper we have derived the structure of optimal policy 
for our power allocation problem by theoretical analysis and 
simulation. We have given the structure of optimal policy on 
total belief space and proved that the optimal policy for this 




(a) 1 -threshold structure (b) 2-thi'eshold structure 



Fig. 7: {?C)RhlRi = 1.25, CJQ = 1.95, Aq = 0.1, Ai = 0.9; 
(b)RhlRi = lM,Ch/Ci = 1.5, Ao - 0.4, Ai = 0.6 

problem has a 1 or 2 threshold structure. With the help of 
linear programming, we can derive the optimal policy with 
key parameters. Further, we would like to find a closed form 
expression for the boundary of action region. Also, we would 
like to investigate the case of non-identical channels like [16], 
or derive useful results for more than 2 channels. 
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